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Abstract:  In  this  paper,  we  document  our  efforts 
in  participating  to  the  TREC  2009  Entity  Ranking 
and  Web  Tracks.  We  had  multiple  aims:  For  the 
Web  Track’s  Adhoc  task  we  experiment  with  doc¬ 
ument  text  and  anchor  text  representation,  and  the 
use  of  the  link  structure.  For  the  Web  Track’s  Di¬ 
versity  task  we  experiment  with  using  a  top  down 
sliding  window  that,  given  the  top  ranked  docu¬ 
ments,  chooses  as  the  next  ranked  document  the 
one  that  has  the  most  unique  terms  or  links.  We 
test  our  sliding  window  method  on  a  standard  doc¬ 
ument  text  index  and  an  index  of  propagated  an¬ 
chor  texts.  We  also  experiment  with  extreme  query 
expansions  by  taking  the  top  n  results  of  the  ini¬ 
tial  ranking  as  multi-faceted  aspects  of  the  topic  to 
construct  n  relevance  models  to  obtain  n  sets  of 
results.  A  final  diverse  set  of  results  is  obtained  by 
merging  the  n  results  lists.  For  the  Entity  Rank¬ 
ing  Track,  we  also  explore  the  effectiveness  of  the 
anchor  text  representation,  look  at  the  co-citation 
graph,  and  experiment  with  using  Wikipedia  as 
a  pivot.  Our  main  findings  can  be  summarized 
as  follows:  Anchor  text  is  very  effective  for  di¬ 
versity.  It  gives  high  early  precision  and  the  re¬ 
sults  cover  more  relevant  sub-topics  than  the  doc¬ 
ument  text  index.  Our  baseline  runs  have  low  di¬ 
versity,  which  limits  the  possible  impact  of  the 
sliding  window  approach.  New  link  information 
seems  more  effective  for  diversifying  text-based 
search  results  than  the  amount  of  unique  terms 
added  by  a  document.  In  the  entity  ranking  task, 
anchor  text  finds  few  primary  pages  ,  but  it  does 
retrieve  a  large  number  of  relevant  pages.  Using 
Wikipedia  as  a  pivot  results  in  large  gains  of  P10 
and  NDCG  when  only  primary  pages  are  consid¬ 
ered.  Although  the  links  between  the  Wikipedia 
entities  and  pages  in  the  Clueweb  collection  are 
sparse,  the  precision  of  the  existing  links  is  very 
high. 


1  Introduction 

Modern  Web  search  requires  the  combination  of  traditional 
topical  relevance  with  other  features  such  as  authority,  re¬ 
cency,  or  diversity.  In  practice  combining  indicators  of  these 
different  features  is  hard:  features  may  be  sparse,  have  dif¬ 
ferent  strengths,  or  have  radically  different  score  distribu¬ 
tions.  This  can  easily  lead  to  disappointing  results  with 
straightforward  combination  methods — even  if  the  features 
are  inherently  useful.  We  propose  a  new  ’sliding  window’ 
approach  that  allows  for  combining  relevance  with  another 
feature.  Given  an  initial  ranked  list,  we  use  a  sliding  window 
of  n  documents,  where  the  window  size  controls  the  relative 
importance  of  the  original  relevance  ranking.  Of  the  docu¬ 
ments  within  the  window,  we  select  the  document  with  the 
highest  score  on  the  new  feature,  and  then  slide  the  window 
down  the  ranking.  Assume  we  have  an  indicator  of  diversity 
and  set  n  =  10,  then  the  first  ranked  document  will  be  the 
most  diverse  from  the  top  10  of  the  original  ranking,  then 
we  add  the  1 1  th  ranked  document  to  the  window,  and  again 
select  the  most  diverse  one.  Etcetera.  The  approach  is  ro¬ 
bust  in  the  sense  that  i)  the  relevance  ranking  is  used  as  a 
basis  and  is  guaranteed  to  be  broadly  respected,  and  ii)  the 
exact  scores  of  the  feature  are  treated  independently  of  the 
relevance  scores,  thereby  avoiding  unfortuitous  effects  in  the 
combination. 

For  the  Adhoc  Task,  we  made  a  number  of  runs  using  the 
document-text  and  propagated  anchor-texts.  We  also  aimed 
for  multi-faceted  results  by  using  the  top  10  retrieved  pages 
as  different  aspects  of  the  topic.  For  each  aspect  a  separate 
relevance  model  is  created,  and  the  resulting  runs  are  merged 
into  a  final  ranking  having  a  more  diverse  set  of  results.  For 
our  Diversity  Task  experiments  we  apply  the  above  sliding 
window  approach  to  different  ad  hoc  runs.  We  assume  that 
the  diversity  topics  are  fairly  broad,  with  hundreds  or  thou¬ 
sands  of  relevant  documents.  The  initial  ranked  list  will  have 
very  high  precision  in  the  first  hundred  or  hundreds  of  re¬ 
sults,  and  we  opt  to  conservatively  re-rank  them  using  a  win¬ 
dow  size  of  10.  Specifically,  we  look  at  two  new  features:  a 
link  filter  and  a  term  filter.  Documents  co-citing  or  co-cited 
by  the  same  set  of  documents  are  topically  related  and  con¬ 
tain  similar  content.  Our  assumption  is  that  a  document  with 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

NOV  2009 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2009  to  00-00-2009 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 


5c.  PROGRAM  ELEMENT  NUMBER 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


4.  TITLE  AND  SUBTITLE 

Result  Diversity  and  Entity  Ranking  Experiments:  Anchors,  Links,  Text 
and  Wikipedia 

6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES)  8.  PERFORMING  ORGANIZATION 

University  of  Amsterdam, Intelligent  Systems  Lab  Amsterdam  report  number 

(ISLA), Science  Park  107,1098  XG  Amsterdam,  The  Netherlands, 

9.  SPONSORING/MONITORING  AGENCY  NAME(S )  AND  ADDRESS(ES )  10.  SPONSOR/MONITOR' S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

Proceedings  of  the  Eighteenth  Text  REtrieval  Conference  (TREC  2009)  held  in  Gaithersburg,  Maryland, 
November  17-20,  2009.  The  conference  was  co-sponsored  by  the  National  Institute  of  Standards  and 
Technology  (NIST)  the  Defense  Advanced  Research  Projects  Agency  (DARPA)  and  the  Advanced 
Research  and  Development  Activity  (ARDA). 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

10 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


many  unseen  links  contains  unseen  information  about  the 
topic,  thereby  diversifying  the  results.  Hence,  we  select  the 
document  that  introduces  the  most  unseen  links  to  the  results 
so  far.  Alternatively,  we  filter  or  re-rank  the  results  list  based 
on  term  overlap.  By  boosting  documents  that  contain  many 
terms  that  do  not  occur  in  the  results  seen  so  far,  we  aim  to 
maximize  the  amount  of  new  information  added  to  the  top 
of  the  ranked  list. 

Entity  ranking  on  the  Web  is  a  difficult  task  with  many 
pitfalls.  Before  any  entities  can  be  ranked,  they  first  have  to 
be  recognized  as  entities  and  classified  into  the  correct  en¬ 
tity  type.  Our  hypothesis  is  that  effective  entity  ranking  on 
the  web  can  be  achieved  by  exploiting  the  available  struc¬ 
tured  information  to  make  sense  of  the  great  amount  of  un¬ 
structured  web  information.  We  propose  to  use  Wikipedia 
to  avoid  the  problem  of  entity  recognition,  and  to  simplify 
the  entity  type  classification.  Wikipedia  is  an  excellent  re¬ 
source  for  entity  ranking  because  of  its  elaborate  category 
structure.  The  TREC  entity  ranking  track  investigates  the 
problem  of  related  entity  finding,  where  entity  types  are  lim¬ 
ited  to  people,  organizations  and  products.  The  people,  or¬ 
ganization  and  product  entity  types  can  easily  be  mapped  to 
Wikipedia  categories.  Successful  methods  for  entity  rank¬ 
ing  in  Wikipedia  have  been  explored  in  the  entity  ranking 
task  that  runs  since  2007  at  INEX  (Initiative  for  the  Evalu¬ 
ation  of  XML  retrieval)  [3,  4]  .  We  investigate  the  relations 
between  the  TREC  and  the  INEX  entity  ranking  task,  and 
try  to  carry  over  methods  that  have  proven  effective  at  the 
Wikipedia  task.  To  retrieve  web  pages  outside  of  Wikipedia 
we  make  use  of  link  information,  in  particular  the  external 
links  already  present  on  the  Wikipedia  pages.  The  effec¬ 
tiveness  of  the  Wikipedia  pivot  approach  is  compared  to  the 
effectiveness  of  two  other  retrieval  methods:  using  a  propa¬ 
gated  anchor-text  index,  and  using  co-citation  information. 

The  rest  of  this  paper  is  organized  as  follows.  In  Sec¬ 
tion  2,  we  describe  the  experimental  set-up.  In  Section  3,  we 
discuss  our  experiments  for  the  Web  Track  and  our  Entity 
Ranking  experiments  are  discussed  in  Section  4.  Finally,  we 
summarize  our  findings  in  Section  5. 


2  Experimental  Set-up 

For  both  the  Entity  Ranking  and  Web  Tracks  we  only  used 
the  category  B  of  the  ClueWeb  collection,  and  Indri  [6]  for 
indexing.  Stopwords  are  removed  and  terms  are  stemmed 
using  the  Krovetz  stemmer.  We  built  the  following  indexes: 

Text:  contains  document  text  of  all  documents  in  ClueWeb 
category  B. 

Anchor:  contains  the  anchor  text  of  all  documents  in 
ClueWeb  category  B.  All  anchors  are  combined  in  a 
bag  of  words.  37,882,935  documents  (75.43%  of  all 
documents)  have  anchor  text  and  therefore  at  least  one 
incoming  link. 


Web  only:  contains  document  text  of  all  non- Wikipedia 
documents  in  ClueWeb  category  B.  This  consists  of  all 
documents  in  part  enOOOO  to  enOOl  1 . 

Wikipedia  only:  contains  document  text  of  all  Wikipedia 
documents  in  ClueWeb  category  B.  This  consists  of  all 
documents  in  part  enwpOO  to  enwp03. 


For  all  runs,  we  use  Jelinek-Mercer  smoothing,  which  is 
implemented  in  Indri  as  follows: 


P(r\D)={1  X^tfr’D  +  X-P(r\C)  (1) 

where  D  is  a  document  in  collection  C.  We  use  little 
smoothing  (A  =  0.15),  which  was  found  to  be  very  effec¬ 
tive  for  large  collections  [7,  8]. 

For  ad  hoc  search,  pages  with  more  text  have  a  higher  prior 
probability  of  being  relevant  [9].  Because  some  web  pages 
have  very  little  textual  content,  we  use  a  linear  document 
length  prior  (3  =  1.  That  is,  the  score  of  each  retrieved  doc¬ 
ument  is  multiplied  by  P(d): 


P(d) 
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(2) 


Using  a  length  prior  on  the  anchor  text  representation  of 
documents  has  an  interesting  effect,  as  the  length  of  the  an¬ 
chor  text  is  correlated  to  the  incoming  link  degree  of  a  page. 
The  anchor  text  of  a  link  typically  consists  of  one  or  a  few 
words.  The  more  links  a  page  receives,  the  more  anchor  text 
it  has.  Therefore,  the  length  prior  on  the  anchor  text  index 
promotes  web  pages  that  have  a  large  number  of  incoming 
links  and  thus  the  more  important  pages. 


3  Web  Track 

We  submitted  runs  for  both  the  Adhoc  and  Diversity  Tasks. 
We  experiment  with  using  the  anchor  text  of  web  pages  as  al¬ 
ternative  document  representation.  The  effectiveness  of  an¬ 
chor  text  for  locating  relevant  entry  pages  is  well  established 
[2,  9]  but  for  ad  hoc  search  it  seems  less  useful  [5,  8].  Given 
the  fairly  large  coverage  of  the  anchors-more  than  75%  of 
the  documents  in  the  collection  have  at  least  one  incoming 
link-and  the  high  density  of  the  link  graph-we  extracted 
over  1.5  billion  collection-internal  links-the  anchor  index 
could  give  high  early  precision,  which  is  required  for  the 
Diversity  task.  As  anchor  text  provides  a  document  repre¬ 
sentation  that  is  disjoint  from  the  document  text,  documents 
that  have  very  similar  anchor  text  might  have  more  dissim¬ 
ilar  document  text.  This  could  be  useful  for  generating  a 
diverse  results  list. 

The  ClueWeb  collection  also  contains  a  snapshot  of  the 
English  Wikipedia,  which  is  very  different  in  nature  from 
the  rest  of  the  World  Wide  Web.  We  want  to  directly  com¬ 
pare  the  results  from  Wikipedia  against  results  from  the  rest 


of  the  Web.  Because  Wikipedia  has  encyclopedic  articles 
on  single  topics,  it  plausibly  has  lower  redundancy  of  infor¬ 
mation  than  the  Web.  This  might  have  a  significant  impact 
on  the  diversity  of  retrieved  Wikipedia  pages,  as  each  page 
should  have  unique  content  and  a  list  of  Wikipedia  pages 
should  naturally  be  diverse. 

3.1  Diversifying  Retrieval  Results 

We  use  two  methods  to  diversify  search  results.  The  first 
method  post-processes  the  initial  ranked  list  using  a  top 
down  filter  and  the  second  method  is  an  extreme  form  of 
relevance  feedback. 

3.1.1  Filtering  using  sliding  windows 

To  make  results  in  the  ranked  list  more  diverse,  we  experi¬ 
ment  with  a  top-down  filtering  method  using  a  sliding  win¬ 
dow  of  n  documents.  We  keep  the  highest  ranked  result  as 
is  and  choose  from  the  next  n  documents  the  one  that  max¬ 
imises  diversity  according  to  some  diversity  indicator.  We 
then  slide  the  window  down  one  step  in  the  list  and  repeat  the 
process.  All  official  runs  use  a  window  of  size  n  =  10.  This 
filter  allows  us  to  easily  test  the  utility  of  different  document 
features  before  spending  a  lot  of  time  finding  the  proper  way 
to  combine  the  most  effective  features.  Because  the  filter 
is  relatively  conservative-documents  can  move  up  at  most 
n  —  1  =  9  ranks-the  initial  relevance  ranking  is  broadly  re¬ 
spected  and  we  avoid  low  ranked  off-topic  documents  with 
extremes  scores  on  some  feature  from  infiltrating  the  higher 
ranks.  If  a  certain  feature  is  not  useful  for  a  certain  task-in 
this  case  diversity-the  sliding  window  approach  guarantees 
its  impact  will  be  small.  If  we  find  an  effective  feature,  we 
can  easily  make  its  impact  bigger  by  increasing  the  size  of 
the  sliding  window.  As  diversity  indicators  we  use  the  num¬ 
ber  of  new  terms  or  new  links  introduced  by  the  next  docu¬ 
ment. 

Term  Filter  (TF):  Term  overlap  is  often  used  to  measure 
document  similarity.  We  use  the  inverse  of  this  idea 
to  achieve  diversity.  Given  the  highest  ranked  docu¬ 
ments),  the  next  document  should  add  new  terms  to 
those  the  user  has  already  seen  in  higher  ranked  docu¬ 
ments.  From  the  documents  within  the  sliding  window 
we  choose  the  one  that  has  the  most  new  terms  to  op¬ 
timise  diversity.  A  side  effect  of  this  feature  is  that  it 
favours  long  documents,  as  they  tend  to  contain  more 
distinct  terms. 

Link  Filter  (LF):  Another  measure  of  document  similarity 
is  co-citation  coupling,  which  is  used  in  citation  anal¬ 
ysis.  The  more  citations  two  documents  have  in  com¬ 
mon,  the  more  similar  their  subject  matter.  We  use  the 
same  approach  as  with  the  term-based  filter  and  choose 
from  the  documents  in  the  sliding  windows  the  one  that 


has  the  most  new  incoming  or  outgoing  links.  With  in¬ 
coming  links  we  measure  how  often  a  document  is  cited 
by  others  that  do  not  cite  documents  higher  in  the  rank¬ 
ing.  With  outgoing  links  we  measure  how  often  a  doc¬ 
ument  cites  web  pages  that  are  not  cited  by  documents 
higher  in  the  ranking.  A  side  effect  of  using  incom¬ 
ing  links  is  that  it  favours  documents  with  a  high  inde¬ 
gree,  which  are  typically  entry  pages  of  sites  or  popu¬ 
lar  pages.  A  side  effect  of  using  outgoing  links  is  that 
it  favours  documents  with  a  high  outdegree,  which  are 
typically  list  pages  or  index  pages. 

3.1.2  Merging  results  from  multiple  relevance  models 

Another  method  is  to  use  the  top  n  documents  as  n  different 
aspects  of  the  search  topic,  and  use  them  for  relevance  feed¬ 
back  to  obtain  diverse  expanded  queries.  For  each  document 
a  separate  relevance  model  is  created  to  obtain  n  results  list, 
which  are  then  merged  into  a  final  ranking.  Assuming  that 
each  document  will  give  a  different  relevance  model,  each 
query  will  represent  the  overall  topic  in  a  slightly  different 
context.  Our  submitted  runs  use  n  =  10  documents. 

3.2  Official  Runs 

We  submitted  two  runs  for  the  Adhoc  Task: 

UamsAw7an3:  mixture  of  text  and  anchor  text  runs. 

Smix(d)  —  A  '  Stext(d)  T  (1  A)  •  sanchor{d)  with 
A  =  0.7 

UamsAwebQElO:  full  Clue  Web  text  index.  10  different 
relevance  models  are  constructed,  one  from  each  docu¬ 
ment  in  the  top  10  results.  The  results  retrieved  using 
the  10  relevance  models  are  merged  into  a  final  ranking 
based  on  their  retrieval  scores. 

We  submitted  three  runs  for  the  Diversity  Task: 

UamsDancTFbl:  Anchor  text  index  run  with  length  prior 
(9  =  1,  term  filter  applied  with  n  =  10. 

UamsDwebLFout:  Text  index  run  with  length  prior  9  =  1, 
link  filter  applied  using  all  outgoing  links  and  n  =  10 

UamsDwebQElOTF :  Text  index  run  with  length  prior  /3  = 
1,  each  result  in  the  top  is  used  as  a  separate  document 
for  query  expansion.  Final  run  is  a  merge  of  10  runs 
using  different  relevance  models. 

3.3  Results 

We  will  first  discuss  results  of  our  baseline  runs  to  show  the 
relative  effectiveness  of  the  various  indexes. 


3.3.1  Baseline  results 

For  the  Adhoc  Tasks  we  report  the  official  statMAP  mea¬ 
sure  and  statMPC@30  in  Table  1.  Clearly,  the  length  prior 
has  a  big  impact  on  performance.  On  the  text  index,  both 
early  and  overall  precision  increase  when  the  length  prior  is 
used.  On  the  anchor  text  index,  the  overall  precision  drops 
slightly  when  using  the  length  prior,  but  the  early  precision 
vastly  improves.  Because  most  documents  in  the  collection 
have  no  or  only  a  few  incoming  links,  the  anchor  text  of 
these  documents  is  poor.  Thus,  the  anchor  text  run  will  miss 
many  of  the  relevant  documents,  as  is  reflected  by  the  low 
average  precision.  Although  we  expected  the  Anchor  run 
to  do  well  on  early  precision,  the  estimated  P@30  of  0.5558 
seems  very  high  when  compared  to  similar  Anchor  only  runs 
on  the  TREC  Terabyte  tracks  [7,  8]  where  their  scores  for 
P@10  and  MAP  are  usually  well  below  those  of  a  full-text 
run.  A  possible  explanation  might  be  found  when  consider¬ 
ing  the  way  relevance  is  estimated.  If  most  runs  contribut¬ 
ing  to  the  assessment  pool  use  a  similar  document  represen¬ 
tation,  a  single  run  using  a  very  different  document  repre¬ 
sentation  might  a  very  different  set  of  documents  in  the  top 
ranks,  which  have  a  low  sampling  probability.  A  document 
with  a  low  sampling  probability  that  is  judged  relevant  rep¬ 
resents  many  estimated  relevant  documents  and  can  result  in 
per  topic  precision  scores  above  1.0  for  runs  that  ranks  these 
documents  highly,  thereby  boosting  the  overall  scores  signif¬ 
icantly.  As  mentioned  before,  the  document  representation 
of  the  anchor  texts  will  be  very  different  from  the  full  text 
representation,  and  hence  result  in  a  very  different  ranked 
list.  Plausibly,  the  anchor  text  model  ranks  certain  relevant 
documents  highly  that  have  a  low  sampling  probability,  re¬ 
sulting  in  an  estimated  precision  well  above  1,  as  is  the  case 
for  our  anchor  text  run.  The  high  statMPC@30  might  be  an 
over-estimation.  We  removed  the  pool  inclusion  probability 
column  from  the  official  prels  and  used  standard  trec_evul  to 
see  if  the  traditional  P@30  measure  gives  similar  results  (Ta¬ 
ble  2)  and  found  that  the  anchor  text  run  has  a  much  lower 
P@30  than  the  full-text  run.  On  the  other  hand,  the  Anchor 
run  has  a  much  higher  Mean  Reciprocal  Rank  (MRR)  than 
the  Text  run.  The  MRR  can  never  be  over-estimated,  as  it 
simply  looks  at  the  rank  of  the  first  retrieved  relevant  docu¬ 
ment.  The  precision  at  rank  30  could  be  under-estimated  if 
the  number  of  judged  results  is  low. 

The  Anchor  run  has  many  more  non-judged  documents  in 
the  top  ranks  than  the  Text  run.  At  rank  5,  less  than  69%  of 
the  results  of  the  Anchor  run  are  judged  and  at  rank  30,  less 
than  29%  is  judged,  while  for  the  Text  run,  over  89%  of  the 
results  at  rank  5  are  judged  and  over  68%  at  rank  30.  This  is 
a  strong  indication  that  the  traditional  P@30  score  of  the  An¬ 
chor  run  is  underestimated.  Together  with  the  much  higher 
MRR  of  the  Anchor  run  and  the  lower  number  of  judged  re¬ 
sults  in  the  top  ranks,  this  supports  the  high  statMPC(30). 

The  Web  only  index  gives  much  better  results  than  the 
Wikipedia  index.  This  is  to  be  expected,  as  the  Web  only 
index  has  many  more  documents  and  also  arguably  more  re- 


Table  1 :  Results  for  the  2009  Adhoc  Task.  Best  scores  are  in 
bold-face. 


Run 

statMAP 

(3  =  0  (3  =  1 

statMPC@30 

(3  =  0  (3=1 

Text 

0.0991 

0.1442 

0.2208 

0.3079 

Anchor 

0.0676 

0.0567 

0.2010 

0.5558 

0.7  Text  +  0.3  Anchor 

0.1244 

0.1687 

0.2952 

0.4812 

Web  only 

0.0880 

0.1044 

0.2181 

0.2528 

Wikipedia 

0.0483 

0.0748 

0.1946 

0.2433 

Table  2:  Ad  hoc  results  using  traditional  measures  and  bi¬ 
nary  judgements 


Run 

P@30 

MRR 

%  Judged  in 
Top  5  Top  30 

Text  (3  =  1 

0.2827 

0.3061 

89.20 

68.33 

Anchor  (3  =  1 

0.1607 

0.6335 

68.80 

28.20 

Table  3:  Impact  of  length  prior  on  Diversity  performance  of 
baseline  runs.  Best  scores  are  in  bold-face. 


Run 

a-nDCG@10 

(3  =  0  (3  =  1 

IA-P@10 

(3  =  0  (3  =  1 

Text 

0.094 

0.120 

0.038 

0.054 

Anchor 

0.178 

0.257 

0.054 

0.082 

0.7  Text  +  0.3  Anchor 

0.156 

0.223 

0.066 

0.083 

Web  only 

0.081 

0.094 

0.032 

0.040 

Wikipedia 

0.065 

0.124 

0.037 

0.071 

dundant  information.  But  as  both  sub-collections  have  rele¬ 
vant  documents,  the  combined  index  contains  more  relevant 
documents  and  is  therefore  even  more  effective. 

For  the  Diversity  Tasks  we  report  the  official  a- 
nDCG@10  and  IA-P@10  measures  in  Table  3.  Again,  we 
see  that  the  length  prior  has  a  big  positive  impact  on  the 
diversity  scores  of  the  baseline  runs.  Give  their  impact  on 
the  Adhoc  scores,  this  is  not  surprising.  The  runs  with  the 
length  prior  have  more  relevant  documents  in  the  top  ranks 
and  thus  have  more  documents  that  receive  score  for  the  di¬ 
versity  measures  as  well.  The  anchor  text  run  scores  much 
higher  for  the  diversity  measures  than  the  full-text  run,  in 
line  with  the  Adhoc  results.  Although  we  explained  why 
the  observed  high  early  precision  score  for  the  Adhoc  Task 
might  be  an  over-estimation,  these  Diversity  results,  which 
are  based  on  different  pools  and  different  relevance  judge¬ 
ments,  indicate  that  the  anchor  text  run  really  has  more  rele¬ 
vance  in  the  top  ranks. 

When  we  look  at  the  performance  of  the  Web  only  and 
Wikipedia  runs,  we  see  that  the  length  prior  again  im¬ 
proves  the  ranking.  Recall  that  on  the  Adhoc  measures,  the 
Wikipedia  run  was  less  effective  than  the  Web  only  run,  with 
and  without  length  prior.  However,  for  the  Diversity  Task, 


the  Wikipedia  run  scores  higher  on  both  reported  measures. 
This  could  mean  that  the  Wikipedia  results  are  more  precise, 
or  that  it  is  easier  to  find  relevant  pages  in  the  relatively  ho¬ 
mogeneous  and  spam-free  Wikipedia  than  in  the  much  larger 
Web.  This  will  be  discussed  further  in  Section  3.3.2. 

3.3.2  Diversifying  methods 

Finally,  we  show  the  impact  of  the  diversity  specific  methods 
in  Table  4.  Runs  filtered  on  distinct  terms  are  denoted  with 
TF(n )  wherer  n  is  the  size  of  the  sliding  window.  Runs 
filtered  on  distinct  links  are  denoted  with  LF(d ,  n)  where  d 
is  the  direction  of  the  links  (incoming  or  outgoing)  and  n  is 
the  size  of  the  sliding  window.  We  use  RF(  10)  to  denote  a 
run  merged  from  the  10  relevance  feedback  runs. 

If  method  A  scores  better  on  a  diversity  measure  than 
method  B,  it  does  not  necessarily  mean  it  has  a  more  di¬ 
verse  ranking.  The  higher  score  could  simply  be  the  result 
of  a  better  relevance  ranking.  To  see  if  differences  observed 
in  the  scores  of  the  diversity  measures  are  caused  by  a  bet¬ 
ter  relevance  ranking  or  a  more  diverse  ranking,  we  present 
standard  document  ranking  measures  as  well.  We  compare 
a-nDCG@10  with  standard  nDCG@10  and  IA-P@10  with 
P@10.  For  this,  we  mapped  the  Diversity  qrels  to  standard 
TREC  Adhoc  qrels  by  assuming  a  document  is  relevant  for 
a  topic  if  it  is  relevant  for  at  least  one  sub-topic. 

We  see  that  the  term  filter  leads  to  a  drop  in  performance 
for  all  baseline  runs  on  all  measures.  The  number  of  un¬ 
seen  terms  seems  ineffective  as  a  feature  to  diversify  search 
results.  The  link  filter  leads  to  better  scores  on  both  the  tradi¬ 
tional  Adhoc  measures  as  on  the  Diversity  measures.  Over¬ 
all,  the  incoming  links  are  more  effective  than  the  outgo¬ 
ing  links,  although  in  combination  with  the  merged  RM (10) 
run,  the  outgoing  links  are  slightly  more  effective  for  P@  10 
and  IA-P@  10.  The  feedback  run  RFi  10)  also  improves  the 
document  ranking  and  diversity  of  the  baseline  run.  On  the 
Anchor  run,  the  filters  are  not  effective.  Of  course,  the  An¬ 
chor  run  already  uses  the  number  of  incoming  links  implic¬ 
itly  through  the  length  prior.  Further  boosting  documents 
with  many  new  incoming  or  outgoing  links  only  hurts  per¬ 
formance.  By  combining  the  anchor  text  and  full-text  runs, 
we  get  a  slight  improvement  on  IA-P@10.  If  we  then  ap¬ 
ply  the  link  filters,  the  P@10  and  IA-P@10  scores  go  up 
further.  The  incoming  links  are  more  effective  than  the  out¬ 
going  links. 

It  is  hard  to  judge  whether  the  diversity  methods  actually 
affect  the  diversity  of  the  baseline  runs.  If  we  compare  the 
scores  for  the  ad  hoc  measures  nDCG@10  and  P@10  with 
the  diversity  measures  a-nDCG@10  and  IA-P@10,  we  see 
similar  patterns.  Runs  that  score  higher  on  nDCG@10  also 
score  higher  on  o-nDCG@  1 0  and  runs  that  score  higher  on 
IA-P@  10  also  score  higher  on  P@  10.  This  suggest  that  the 
changes  on  the  diversity  scores  do  not  reflect  changes  in  ac¬ 
tual  diversity.  The  link  filters  seem  to  merely  work  as  inde¬ 
gree  priors  and  push  up  important  documents.  Ad  hoc  preci¬ 


sion  goes  up  a  lot  but  diversity  goes  up  only  a  little  bit.  The 
run  is  not  more  diverse  but  simply  has  more  relevance  in  the 
top  ranks. 

To  shed  some  more  light  on  how  our  methods  affect  the 
diversity  of  the  results,  we  look  at  the  percentage  of  sub- 
topics  for  which  relevant  documents  are  found.  In  Table  5 
we  show  the  percentage  (macro  average)  of  sub-topics  cov¬ 
ered  by  the  retrieved  results  at  various  rank  cut-offs.  In  the 
relevance  judgements  we  find  relevant  documents  for  199 
different  sub-topics  for  49  topics.  This  means  that  for  one  of 
the  50  topics,  not  a  single  document  in  the  pool  was  judged 
relevant  for  one  of  the  chosen  sub-topics.  We  see  that  the  top 
10  documents  of  the  Text  run  contain  relevant  documents 
for  only  16.3%  out  of  the  199  sub-topics  while  the  top  10  of 
the  Anchor  run  covers  28.5%.  The  anchor  text  run  is  thus 
not  only  more  precise,  but  also  more  diverse.  The  term  fil¬ 
ter  has  a  small  negative  impact  on  the  number  of  sub-topics 
found,  while  the  link  filters  have  a  positive  impact,  except 
for  the  Anchor  run.  The  outlink  filter  is  boost  more  diverse 
sub-topics  than  the  inlink  filter.  The  merged  query  expan¬ 
sion  runs  make  the  top  ranked  results  more  diverse,  showing 
that  the  improvements  for  the  diversity  measures  in  Table  4 
are  not  only  based  on  higher  precision.  Combining  the  T ext 
and  the  Anchor  runs  has  almost  no  impact  on  the  number 
of  sub-topics  covered  in  the  top  ranks  of  the  baseline  run. 
For  this  run,  the  inlink  filter  is  more  effective  than  the  out¬ 
link  filter.  If  we  look  further  down  the  ranking,  we  see  that 
relevant  documents  for  much  more  sub-topics  are  retrieved. 
The  impact  of  the  diversity  methods  is  almost  negligible  at 
rank  100  and  lower.  The  combination  of  Text  and  Anchor 
runs  does  increase  the  number  of  topics  found  later  in  the 
ranking.  The  Wikipedia  run  is  far  less  diverse  than  the  Web 
only  run.  The  higher  diversity  score  must  come  from  a  better 
relevance  ranking  of  the  top  results. 

Note  that  the  sliding  window  filter  allow  documents  to 
move  up  n  —  1  at  the  most.  Thus,  for  the  top  10  docu¬ 
ments,  a  sliding  window  of  n  =  10  documents  can  select 
documents  from  the  top  19  results  of  the  original  ranking. 
The  number  of  sub-topics  found  in  the  top  20  of  the  origi¬ 
nal  ranking  provides  an  upper  bound  of  the  number  of  topics 
that  we  can  possibly  have  in  the  top  10  of  the  filtered  runs. 
The  small  impact  of  the  filters  is  due  to  the  low  diversity  in 
the  initial  text-based  relevance  ranking.  With  only  26.1% 
of  the  sub-topics  covered  in  the  top  20  results  for  49  topics 
(1.06  sub-topics  per  topic),  there  is  not  much  to  diversify. 
For  the  filters  to  have  more  impact,  the  windows  size  needs 
to  be  increased  to  move  up  documents  from  further  down 
the  ranking.  As  mentioned  before,  the  danger  is  that  this 
leads  to  infiltration  of  off-topic  documents  that  have  many 
links  or  are  very  long.  The  sliding  window  size  is  kept  low 
to  broadly  respect  the  initial  text-based  ranking.  With  larger 
window  sizes,  the  impact  of  the  initial  ranking  decreases. 


Table  4:  Results  for  runs  using  the  sliding  window  filters  and  merge  of  multiple  query  expansions  on  the  2009  Adhoc  topics. 
Best  scores  are  in  bold-face. 


Run 

nDCG@10 

Diversity 

a-nDCG@10 

P@10 

IA-P@  10 

Text 

0.1564 

0.120 

0.1700 

0.054 

Text  TF (10) 

0.1450 

0.122 

0.1560 

0.048 

Text  LF(in ,  10) 

0.1924 

0.154 

0.2020 

0.068 

Text  LF(out,  10) 

0.1873 

0.145 

0.2000 

0.063 

Text  RF (10) 

0.1888 

0.150 

0.2080 

0.067 

Text  RF (10)  TF (10) 

0.1536 

0.123 

0.1700 

0.049 

Text  RF (10)  LF(in ,  10) 

0.2098 

0.170 

0.2200 

0.068 

Text  RF (10)  LF(out,  10) 

0.2053 

0.168 

0.2260 

0.069 

Anchor 

0.2780 

0.257 

0.2460 

0.082 

AnchorTF(  10) 

0.2665 

0.250 

0.2380 

0.079 

Anchor  LF  (in,  10) 

0.2442 

0.233 

0.2060 

0.066 

Anchor  LF  (out,  10) 

0.2373 

0.236 

0.2080 

0.071 

0.7  Text  +  0.3  Anchor 

0.2459 

0.223 

0.2420 

0.083 

0.7  Text  +  0.3  Anchor  TF(10) 

0.2363 

0.209 

0.2280 

0.075 

0.7  Text  +  0.3  Anchor  LF(in,  10) 

0.2719 

0.244 

0.2640 

0.090 

0.7  Text  +  0.3  Anchor  LF(out,  10) 

0.2593 

0.229 

0.2540 

0.086 

Table  5:  Percentage  of  sub-topics  (macro  average)  for  which 
at  least  one  relevanat  document  is  found  at  different  rank 
cut-offs. 


Top 


Run 

10 

20 

100 

1000 

Text 

16.3 

26.1 

41.0 

51.4 

Text  TF (10) 

16.8 

23.5 

40.6 

51.4 

Text  LF(in,  10) 

19.4 

26.6 

40.6 

51.4 

Text  LF(out,  10) 

20.3 

29.2 

40.7 

51.4 

Text  RF (10) 

21.4 

27.4 

41.3 

51.3 

Text  RF (10)  TF (10) 

18.4 

27.2 

41.4 

51.3 

Text  RF (10)  LF(in,  10) 

22.0 

33.0 

40.9 

51.3 

Text  RF(  10)  LF(out,  10) 

23.3 

33.3 

41.4 

51.3 

Anchor 

28.5 

34.2 

44.7 

52.0 

Anchor  TF (10) 

27.2 

33.7 

43.9 

52.0 

Anchor  LF(in,  10) 

25.9 

32.6 

45.2 

52.0 

Anchor  LF(out,  10) 

28.2 

32.2 

44.7 

52.0 

T  ext  +  Anchor 

27.2 

34.8 

50.2 

59.3 

Text  +  Anchor  TF (10) 

25.3 

32.7 

50.5 

59.3 

Text  +  Anchor  LF(in ,  10) 

29.4 

37.1 

50.5 

59.6 

Text  +  Anchor  LF(out,  10) 

27.8 

35.4 

50.1 

59.6 

Web  only 

15.1 

24.8 

40.9 

50.4 

Wikipedia 

8.7 

8.7 

11.1 

12.6 

4  Entity  Ranking 

For  the  entity  ranking  track,  we  have  experimented  with  dif¬ 
ferent  approaches,  which  are  discussed  in  this  section.  We 
use  anchor  text  representations  (assuming  the  entity’s  name 
will  be  frequent  in  incoming  anchors);  co-citations  (assum¬ 
ing  similar  entities  will  receive  similar  incoming  links);  and 
use  Wikipedia  as  a  pivot  (assuming  entities  have  unique 
Wikipedia  pages,  which  are  neatly  organized  and  may  con¬ 
tain  external  links  toward  the  most  suitable  homepage). 

4.1  Anchor  Text 

Our  first  approach  tries  to  apply  an  ad  hoc  retrieval  method 
to  the  task  of  related  entity  finding.  We  use  the  ClueWeb 
Anchor  text  index  that  is  described  in  Section  2.  Queries 
consist  of  the  concatenation  of  the  entity  name  and  the  nar¬ 
rative.  The  initial  result  ranking  is  in  the  ad  hoc  format.  To 
convert  the  results  to  the  entity  ranking  format,  we  use  a  very 
naive  approach.  In  our  official  run  the  first  300  results  of  the 
initial  ranking  are  grouped  into  groups  of  three.  Each  result 
entity  consist  of  a  group  of  three  pages,  where  each  page  is 
an  entity  homepage.  If  Wikipedia  results  occur  in  the  ini¬ 
tial  ranking,  they  are  added  to  the  result  entities  ordered  by 
score.  We  show  additional  results,  where  each  result  entity 
consists  of  only  one  web  and  one  Wikipedia  page 

4.2  Co-citations 

For  the  Entity  Ranking  topics  an  example  relevant  entity  is 
provided.  Given  the  large  link  graph  of  the  ClueWeb  col¬ 
lection,  we  want  to  exploit  co-citation  information  to  find 


entities  similar  to  the  example  entity.  For  this,  we  first  find 
the  set  S  of  all  pages  s  that  link  to  the  example  entity  e.  For 
each  page  s,  we  consider  all  outgoing  links  as  pointers  to 
pages  t  about  possibly  similar  entities.  The  number  of  pages 
in  S  that  link  to  a  target  page  t  is  the  co-citation  frequency  of 
t  and  e.  The  more  t  and  e  are  co-cited,  the  more  similar  they 
are.  We  consider  the  links  from  pages  with  a  small  number 
of  outgoing  links  to  be  more  valuable  than  links  from  pages 
with  a  high  outgoing  link  degree.  Thus,  we  weight  each  link 
from  a  page  s  to  page  t  by  the  outgoing  link  degree  of  s. 
More  formally,  the  similarity  score  between  a  target  entity  t 
and  example  entity  e  is  given  by: 


nm{t,  e)  =  ^  l(e  <-  s)  ^ 


l(s  — ►  t) 
outdegree(s) 


(3) 


where  l(s  — ►  t)  is  1  if  there  is  a  link  from  s  to  t  and  0 
otherwise.  The  entities  are  then  ranked  by  their  similarity 
score  sim(t,  e). 


4.3  Wikipedia 

Our  last  approach  exploits  the  information  in  Wikipedia.  To 
complete  the  task  of  related  entity  finding,  we  take  a  number 
of  steps. 

1 .  Rank  all  Wikipedia  pages  according  to  their  match  to 
the  entity  name  and  narrative. 


subcategories  can  have  more  than  one  parent,  the  structures 
as  a  whole  is  not  a  tree,  but  rather  a  directed  acyclic  graph. 

In  the  entity  ranking  track  only  three  high  level  types  of 
entities  are  used:  persons,  products  and  organisations.  ‘Per¬ 
sons’  is  a  clearly  defined  concept.  Organisations  and  prod¬ 
ucts  on  the  other  hand  are  less  clearly  defined.  In  the  training 
topics  certain  groups  of  people,  i.e.  a  band,  or  more  abstract 
concepts  like  ‘Motorsport  series  that  Bridgestone  officially 
supports  with  tyres’  are  included  as  organisations.  A  prob¬ 
lem  with  the  ‘Products’  entity  type  is  the  granularity,  differ¬ 
ent  versions  of  a  product  might  have  their  own  homepage, 
which  makes  them  undesirable  eligible  as  an  entity. 

To  map  the  entity  types  to  Wikipedia  categories,  we  ex¬ 
periment  with  the  following  approach.  We  manually  map 
a  number  of  lower  level  Wikipedia  categories  to  each  en¬ 
tity  type.  Each  document  gets  a  binary  score,  either  the 
document  categories  include  one  of  the  target  categories  or 
not.  All  documents  including  one  of  the  target  categories 
are  ranked  above  all  documents  not  including  one  of  the  tar¬ 
get  categories.  The  entity  types  are  mapped  to  the  following 
categories: 

•  Persons 

-  ‘Living  People’ 

-  Ending  with  ‘births’ 

-  Ending  with  ‘deaths’ 

-  Starting  with  ‘People’ 


2.  Scores  of  Wikipedia  pages  which  belong  to  the  cor¬ 
rect  target  category  (i.e.  Persons,  Products  or  Organi¬ 
zations)  are  boosted. 

3.  To  find  primary  result  pages,  we  follow  the  external 
links  on  the  Wikipedia  page  to  find  matches  with  the 
Clue  web  Category  B  URLs. 

The  second  step  is  optional.  We  have  made  two  official 
runs:  excluding  (Wiki  Base)  and  including  (Wiki  Cats)  the 
second  step.  More  detail  on  the  category  mappings  used  in 
the  second  steps  follow  below.  In  our  official  runs,  in  the 
third  step  all  Wikipedia  pages  without  matches  to  the  Cate¬ 
gory  B  URL’s  are  dropped  from  the  ranking.  We  made  an  ad¬ 
ditional  run  where  Wikipedia  pages  without  Clueweb  links 
are  retained  in  the  ranking  and  a  dummy  Clueweb  page  is 
inserted  in  the  result  to  make  them  the  right  format. 

4.3.1  Category  Mapping 

In  the  Wikipedia  context  we  consider  each  Wikipedia  page 
as  an  entity.  The  Wikipedia  page  title  is  the  label  or  name  of 
the  entity.  Currently  in  the  English  part  of  Wikipedia  there 
are  over  3  million  pages.  Wikipedia  employs  a  fine  grained 
categorisation  system,  consisting  of  more  than  70.000  cate¬ 
gories.  Each  page  is  categorised  into  at  least  one  category. 
The  categories  form  a  hierarchical  structure,  but  because 


•  Organizations 

-  Starting  with ’Organizations’ 

-  Starting  with  '  Companies  ‘ 

•  Products 

-  Starting  with  ‘Products’ 

-  Ending  with  ’introductions1 

4.4  Results 

In  this  section  we  discuss  the  results  of  our  official  runs  and 
some  additional  runs.  We  report  the  results  in  the  official 
measures  of  the  track:  NDCG@R  and  P@10.  The  offi¬ 
cial  NDCG@R  score  also  credits  relevant  pages,  i.e.  pages 
that  are  related  to  the  query  topics  without  being  actual 
homepages  for  the  entities  [1],  We  have  calculated  another 
NDCG@R  score  that  credits  only  the  primary  pages,  i.e. 
the  (authorative)  entity  homepages.  Each  evaluation  mea¬ 
sure  is  calculated  for  Wikipedia  pages  only,  homepages  only, 
and  for  their  combination  where  both  Wikipedia  pages  and 
other  homepages  are  considered.  In  the  assessments  we  have 
substituted  redirected  Wikipedia  pages  with  the  single  non- 
redirected  page.  All  results  are  presented  in  Table  6. 

The  Wikipedia  based  runs  retrieve  the  most  primary  pages, 
homepages  as  well  as  Wikipedia  pages.  The  Wikipedia 


Table  6:  Entity  Ranking  Results 


Evaluation 

Measure 

Pages 

Anchor  Text 

Co-citations 

Wiki  Base 

Wiki  Cats 

Groups  of  3 

Groups  of  1 

Only  links 

Dummy  pages 

Only  links 

WP 

22 

22 

43 

56 

81 

57 

Primary 

#  Pages 

HP 

19 

18 

23 

40 

22 

40 

All 

41 

40 

65 

96 

101 

97 

WP 

0.0300 

0.0300 

0.0200 

0.1000 

0.1200 

0.1550 

Primary 

P@10 

HP 

0.0450 

0.0100 

0.0400 

0.0500 

0.0300 

0.0550 

All 

0.0700 

0.0350 

0.0600 

0.1200 

0.1250 

0.1650 

WP 

0.0427 

0.0427 

0.0246 

0.0896 

0.1090 

0.1091 

Primary 

NDCG@R 

HP 

0.0495 

0.0211 

0.0515 

0.0746 

0.0319 

0.0465 

All 

0.0685 

0.0436 

0.0611 

0.1059 

0.1125 

0.1138 

WP 

0.1646 

0.1653 

0.0504 

0.1762 

0.1977 

0.1665 

All 

NDCG@R 

HP 

0.1773 

0.1625 

0.1265 

0.1043 

0.0880 

0.0805 

All 

0.1820 

0.1828 

0.1397 

0.1328 

0.1425 

0.1187 

base  run  with  only  links  throws  out  a  number  of  primary 
Wikipedia  pages  that  do  not  have  a  link  to  a  Clueweb  page. 
The  run  with  the  dummy  Clueweb  pages  finds  more  primary 
Wikipedia  pages,  but  less  primary  pages  are  found,  because 
of  the  insertion  of  dummy  pages.  Also,  since  the  run  with 
the  dummy  pages  is  not  an  official  run,  more  pages  are  un¬ 
judged. 

The  difference  between  the  anchor  text  runs  with  groups 
of  3  pages  in  one  result,  and  one  page  in  each  result,  is  very 
small.  The  run  with  groups  of  3  has  a  higher  P@10,  more 
relevant  results  are  found  among  the  top  ranked  documents. 
However,  since  a  result  with  more  than  one  relevant  page  is 
rewarded  the  same  as  a  result  with  just  one  relevant  page, 
some  relevant  pages  in  the  grouped  run  will  be  redundant 
and  not  rewarded. 

The  Co-citations  run  is  the  only  run  that  makes  use  of  the 
given  entity  in  the  topic  to  which  the  result  entities  should 
be  related.  By  calculating  similarity  scores  to  the  given 
entity  we  are  able  to  find  a  reasonable  number  of  primary 
Wikipedia  and  homepages.  The  link  information  does  pro¬ 
vide  useful  information  on  the  relationships  between  enti¬ 
ties.  In  future  work  we  would  like  to  investigate  if  this  link 
information  can  be  combined  with  the  using  Wikipedia  as  a 
pivot  approach  and  achieve  additional  improvement. 

Looking  at  P10,  the  Wikipedia  run  that  reranks  pages  ac¬ 
cording  to  their  categories  scores  best.  When  we  compare 
the  base  Wikipedia  run  and  the  category  run,  13  topics  get 
the  same  score,  on  3  topics  the  base  run  is  best,  and  on  4 
topics  the  category  run  scores  best.  With  the  small  number 
of  topics  in  the  test  set,  the  average  score  can  be  influenced 
by  just  a  few  topics.  Another  issue  is  that  13  out  of  the  20 
topics  have  less  than  10  primary  Wikipedia  pages,  and  14 
out  of  the  20  topics  have  less  than  10  primary  homepages. 
So  there  is  already  a  certain  upper  limit  which  is  less  than  1 
for  P10. 

Looking  at  the  official  measure  of  the  task,  NDCG@R 
evaluated  on  web  pages,  our  anchor  text  run  outperforms 


all  other  runs,  even  though  it  finds  the  smallest  number  of 
primary  pages.  The  anchor  text  run  finds  a  large  number 
of  relevant  pages,  which  is  why  it  performs  so  well  on  this 
measure.  We  are  more  interested  however  in  the  retrieval 
of  primary  pages  only,  looking  at  the  NDCG@R  scores  for 
primary  home  pages,  none  of  our  runs  score  very  well.  The 
baseline  Wikipedia  run  with  only  links  scores  best  with  an 
NDCG@R  of  0.0746.  When  we  also  credit  Wikipedia  pages, 
the  scores  improve  somewhat. 

Besides  evaluating  the  runs  on  the  official  test  data,  we 
evaluate  some  separate  steps  in  our  approach  to  analyse 
where  our  approach  could  be  improved.  For  our  runs  we 
discarded  the  given  entity  information.  The  given  entities 
were  identified  by  a  non- Wikipedia  homepages.  Since  we 
want  to  exploit  the  structured  information  in  Wikipedia  this 
was  not  very  helpful.  So,  instead  we  try  to  match  the  given 
entity  to  a  Wikipedia  page.  We  try  two  approaches:  we  use 
the  entity  name  as  a  query  to  the  full  text  Wikipedia  index 
and  retrieve  the  top  results,  and  we  do  exact  string  matching, 
where  all  characters  are  lower  cased  and  special  characters 
are  removed.  Results  can  be  found  in  Table  7. 


Table  7 :  Entity  Finding  Results 


Results 

Index 

String  Matching 

None 

0 

7 

Irrelevant 

5 

1 

Relevant 

13 

0 

Primary 

2 

12 

Besides  the  12  exact  string  matches,  3  more  primary 
Wikipedia  pages  can  be  found  with  a  less  strict  matching 
algorithm,  so  for  15  out  of  the  20  entities  we  can  find  a  pri¬ 
mary  page  in  Wikipedia.  The  top  retrieved  result  from  the 
index  returns  a  relevant  result  for  the  majority  of  topics,  but 
only  few  primary  pages  are  identified.  Using  exact  string 


matching,  we  can  find  primary  Wikipedia  pages  with  high 
precision  for  a  majority  of  the  topics.  Using  the  links  and 
categories  in  Wikipedia  from  and  to  the  primary  homepages, 
can  provide  additional  information  that  can  help  solve  the 
entity  relationship  search  task. 

Finally,  we  want  to  evaluate  the  last  step  in  our  Wikipedia 
based  approach,  i.e.  finding  a  primary  homepage  on  the  web 
for  a  primary  Wikipedia  page.  To  measure  how  well  we  do 
here,  we  use  the  15  primary  Wikipedia  pages  in  combination 
with  the  given  primary  homepages  in  the  topic  specificiation 
as  our  testset.  To  find  primary  homepages  we  look  at  two 
specific  parts  of  the  Wikipedia  page,  the  website  specified  in 
the  ‘Infobox’,  and  the  links  specified  in  the  ‘External  Links’ 
section.  For  5  entities  both  the  ‘Infobox’  website  and  the 
first  ‘External  link’  point  to  the  given  entity  site.  For  2  enti¬ 
ties  the  ‘Infobox’  website  is  correct,  and  for  5  more  entities 
the  ‘External  links’  point  to  the  correct  site.  In  all  but  one 
case  the  first  external  link  is  the  correct  link.  Only  2  enti¬ 
ties  do  no  have  any  link  to  the  given  entity  site.  Although 
this  test  set  is  small,  our  performance  is  promising,  for  the 
large  majority  of  the  pages  we  can  find  a  correct  link  to  a 
homepage  on  the  Web.  Our  Wikipedia  runs  on  the  entity 
ranking  task,  find  much  more  Wikipedia  primary  pages  than 
primary  pages  on  the  Web.  This  has  two  reasons.  First  of 
all,  the  Clueweb  Category  B  collection  is  of  a  limited  size, 
and  does  not  include  a  large  part  of  the  pages  linked  to  from 
Wikipedia.  When  the  complete  Clueweb  collection  will  be 
considered,  this  problem  could  become  significantly  smaller. 
Secondly,  a  lot  of  the  Web  homepages  are  not  judged,  espe¬ 
cially  the  pages  in  our  unofficial  runs.  Judging  more  pages 
would  make  this  test  set  more  reusable. 

5  Conclusions 

In  this  paper,  we  detailed  our  official  runs  for  the  TREC  2009 
Web  Track  and  Entity  Ranking  Track  and  performed  an  ini¬ 
tial  analysis  of  the  results.  We  now  summarize  our  prelimi¬ 
nary  findings. 

We  experimented  with  indexes  of  different  document  rep¬ 
resentations  and  a  sliding  window  filter  to  combine  text- 
based  ranking  with  diversity  features.  Assuming  a  user  starts 
reading  the  results  list  from  the  top  and  has  seen  the  first  m 
documents,  we  choose  from  documents  m  +  1  to  m  +  n  in 
the  text-based  ranking  the  one  that  has  the  highest  diversity 
score  using  some  feature,  add  it  to  the  final  results  list  at 
rank  m  +  1  and  slide  down  the  window  to  ranks  m  +  2  to 
m  +  n  +  1.  As  diversity  features  we  consider  the  number 
of  incoming  links  not  seen  in  higher  ranked  results  and  the 
number  of  distinct  terms  not  seen  in  higher  ranked  results. 

For  the  initial  text-based  run,  anchor  text  is  very  effec¬ 
tive  as  it  has  more  relevant  documents  in  the  top  20  ranks 
than  standard  full-text  runs,  which  cover  more  diverse  as¬ 
pects  of  the  search  topic.  The  sliding  window  filter  shows 
that  link  information  is  more  effective  than  the  number  of 
unseen  words  to  diversify  retrieval  results.  The  expection  is 


the  anchor  text  run,  which  already  implicitly  uses  link  infor¬ 
mation  through  the  length  prior.  For  runs  using  the  document 
text,  or  a  combination  of  document  text  and  anchor  text,  the 
incoming  link  filter  increases  the  number  of  sub-topics  cov¬ 
ered  by  the  top  ranked  results. 

The  initial  document  text-based  run  covers  0.84  sub-topics 
in  the  top  10  and  1.34  sub-topics  in  the  top  20,  on  average. 
With  a  sliding  window  of  size  10,  which  allows  results  to 
move  up  9  ranks  at  the  most,  the  lack  of  diversity  in  the 
top  20  limits  the  impact  the  sliding  window  filter  can  have 
on  the  diversity.  To  have  more  impact,  the  size  of  the  win¬ 
dow  could  be  increased,  but  with  such  low  precision  scores, 
this  also  increases  the  chances  of  infiltration  of  very  long  or 
highly  connected  but  off-topic  pages.  As  the  size  of  the  win¬ 
dow  increases,  the  impact  of  the  initial  text-based  ranking 
decreases.  The  impact  of  window  size  will  be  addressed  in 
future  research. 

For  entity  ranking  we  experimented  with  three  ap¬ 
proaches:  using  anchor  text  representations  (assuming  the 
entity’s  name  will  be  frequent  in  incoming  anchors);  using 
co-citations  (assuming  similar  entities  will  receive  similar 
incoming  links);  and  using  Wikipedia  as  a  pivot  (assuming 
entities  have  unique  Wikipedia  pages,  which  are  neatly  orga¬ 
nized  and  may  contain  external  links  toward  the  most  suit¬ 
able  homepage). 

From  our  experiments  we  can  draw  the  following  con¬ 
clusions.  Anchor  text  works  well  for  finding  relevant  Web 
pages,  but  not  so  well  for  finding  primary  Web  pages.  The 
link  information  used  to  make  the  co-citations  run,  does  pro¬ 
vide  clues  to  find  primary  homepages,  but  just  estimating 
similarity  scores  for  the  given  entities  is  not  sufficient.  Us¬ 
ing  Wikipedia  as  a  pivot  works  well  for  finding  primary 
Wikipedia  pages.  Additionally,  the  links  to  the  Clueweb 
collection  from  the  ‘Infobox’  and  ‘External  links’  section 
of  Wikipedia  may  be  sparse,  but  the  precision  of  the  linked 
Clueweb  pages  is  very  high.  Using  the  high  level  category 
information  leads  to  improvements  mainly  in  early  preci¬ 
sion. 

From  this  first  year  of  the  Entity  Ranking  track  we  learn 
that  link  information  is  very  important:  anchor  text  can  be 
used  to  find  relevant  pages,  co-citations  can  be  used  to  find 
similar  entities,  and  links  from  Wikipedia  to  the  Web  can  be 
used  to  find  primary  homepages.  Secondly,  Wikipedia  is  an 
excellent  entity  repository  for  this  task.  It  covers  the  a  large 
range  of  possible  entity  ranking  topics,  and  its  structure  can 
be  used  to  effectively  rank  entities. 
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